Email Classification

Introduction

The dataset used is the Enron email dataset which is the collection of public domain emails from the Enron corporation. The emails have been manually classified as spam and ham (non-spam). The objective is to create a supervised classification pipeline to classify emails as spam or non-spam from the training data. We will be comparing various superivised classification models and compute the accuracy of models and thereby selecting the most accurate model.

Various Steps involved:

  1. Preprocessing: Training and test splits, Feature extraction
  2. Exploratory data analysis
  3. Supervised classification: Model selection, Model evaluation

Importing the required libraries

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
import plotly.express as plt
import matplotlib.pyplot as py
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import plot_roc_curve
from sklearn.model_selection import cross_val_score

nltk.download('stopwords')

# installing plotly package
#!pip install plotly
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dalal\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Out[1]:
True
  1. numpy: To perform scientific computing and data manipulation on multi-dimensional arrays and matrices
  2. pandas: Used for data manipulation and analysis
  3. re: Provides regular expression matching operations to remove special and unwanted characters which have no meaning or feeling behind it such as "<", ">" etc
  4. nltk Used for symbolic and statistical natural language processing for English written such as removing the stop words and tokenizing the words
  5. matplotlib and plotly For Data Visualisation
  6. sklearn: Scikit-learn is used for implementing machine learning algorithms, feature extractions and computing various scores of predicted models

Loading Dataset

In [2]:
enron_data = load_files("C:\\Users\\dalal\\Desktop\\CIT Modules + Lectures + Materials\\- Sem 2\\Applied Machine Learning - COMP9060_26651\\Project 1\\enron1")

print ("Total number of Emails loaded: %d emails" % len(enron_data.filenames))
print("Categories Loaded: ",enron_data.target_names)
Total number of Emails loaded: 5172 emails
Categories Loaded:  ['ham', 'spam']

With the help of 'load_files' function from the sklearn library [1],text files which contains individual emails with categories 'ham' and 'spam' as subfolder names are loaded. 'load_files' function outputs 'data' and 'target'.

There are 5172 total emails present from both the folders and the categories are 'ham' and 'spam'.

In [3]:
# Creating two lists to store the emails and category of the email (ham or spam)
email_message, email_category = [], []
email_message = np.append(email_message, enron_data.data)
email_category = np.append(email_category, enron_data.target) 

dict = {'Email Body' : email_message,'Email Category' : email_category}
email_dataframe = pd.DataFrame(dict)

email_dataframe["Email Body"] = email_dataframe["Email Body"].astype(str)  
email_dataframe["Email Category"] = email_dataframe["Email Category"].astype(str)

Creating the two lists 'email_message' to store the emails and 'email_category' to store category of the email (ham or spam) using the numpy library. 'target' variable reads the category in float type where 0.0 is ham and 1.0 is spam.

'email_dataframe' Data Frame is created by adding these lists to a dictionary using pandas library.

Since both the Series object in the Data Frame are taking float values, we are converting to string data type in order to perform string operations.

In [4]:
# Changing the Email Category value 0.0 to ham and 1.0 to spam
email_dataframe["Email Category"] = email_dataframe["Email Category"].replace({"0.0":"ham","1.0":"spam"})

print(email_dataframe.describe())
email_dataframe.head()
                                               Email Body Email Category
count                                                5172           5172
unique                                               4994              2
top     b'Subject: calpine daily gas nomination\r\n>\r...            ham
freq                                                   20           3672
Out[4]:
Email Body Email Category
0 b'Subject: nesa / hea \' s 24 th annual meetin... ham
1 b'Subject: meter 1431 - nov 1999\r\ndaren -\r\... ham
2 b"Subject: investor here .\r\nfrom : mr . rich... spam
3 b"Subject: hi paliourg all available meds . av... spam
4 b'Subject: january nominations at shell deer p... ham

Changing the 'Email Category' value 0.0 to ham and 1.0 to spam in order to display clearly.

From the 5172 emails, we notice that there are 4994 emails which are unique. So the dataset contains duplicated emails.

Cleaning the Data

Dropping Dupicates and checking if NAN's are present

In [5]:
# Removing the duplicate emails
email_dataframe = email_dataframe.drop_duplicates()
email_dataframe.shape
Out[5]:
(4994, 2)

After removing the duplicates, there are 4994 emails which are unique.

In [6]:
# Checking if any null values or NAN or NA values are present that needs to be removed
email_dataframe["Email Body"].isnull().sum()
Out[6]:
0

There are no null values or NAN or NA values present that needs to be removed.

Email Cleaning

In [7]:
# Removing special characters like !,@, ., %, etc and keeping only alphabets and numbers

clean_email_no_special = []

for email in email_dataframe["Email Body"]:
    string = email
    
    #converting the words to lower case
    string = string.lower()
    
    #removing the letter b' which is the very first word
    string = re.sub(r'^b+', '', string) 
    
    #list of special characters to be removed
    special_char = ['subject','\n','\\n','\t','\\r','\r','\'','"',':',';','.','/','!','#','$','%','{','}','(',')','?','&',']','[','-','_']
    for char in special_char:
        string = string.replace(char,' ')
   
    string = re.sub("[^A-Za-z0-9]+", ' ', str(string))
    
    # removing numbers from the string
    pattern = '[0-9]'
    string = re.sub(pattern, '', string)
    
    #Removing single characters in the string
    string = ' '.join( [w for w in string.split() if len(w)>1] )
    
    clean_email_no_special.append(string)

Reading each email from the Data Frame, we have converted every word to lower case.
Every email has the character 'b' in the start which is removed.
Creating a list of special characters such as '\n','\n','\t','\r','\r','\'','"',':',';','.','/','!','#','$', etc are removed.
Removing the 'Subject' word as it appears in every email.
We are removing numbers from the string and also the single characters from the string.
'clean_email_no_special' list is obtained after cleaning.

An example of an email before and after cleaning:

Before Cleaning:

In [8]:
print([email_dataframe["Email Body"][10]])
["b'Subject: rates\\r\\ndaren -\\r\\nrates for september :\\r\\nt - ville interconnects to equistar channelview $ . 15\\r\\nagua dulce interconnects to equistar channelview $ . 13\\r\\nlet me know if you get something done or need quotes on something else ! term rate on # 6461 to follow .'"]

After Cleaning:

In [9]:
print(clean_email_no_special[10])
rates daren rates for september ville interconnects to equistar channelview agua dulce interconnects to equistar channelview let me know if you get something done or need quotes on something else term rate on to follow

Removing the stop words

In [10]:
stop_words = set(stopwords.words('english')) 

clean_email_no_stopwords = []
for i in range(len(clean_email_no_special)):
    word_tokens = word_tokenize(clean_email_no_special[i]) 
    words = [w for w in word_tokens if not w in stop_words] 
    clean_email_no_stopwords.append(words)

In order to remove the stop words (such as “the”, “a”, “an”, “in”), we import those words from 'nltk' library in the corpus module. To remove stop words from a sentence from the list of emails, we need to divide the text into words and then remove the word if it exits in the list of stop words provided by NLTK. 'word_tokenize' function has been used in order to do that.

An example of an email before and after removing the stop words:

Before removing the stop words:

In [11]:
print(clean_email_no_special[1])
meter nov daren could you please resolve this issue for howard will be out of the office the next two days when this is done please let george know thanks aimee forwarded by aimee lannou hou ect on pm howard camp pm to aimee lannou hou ect ect cc daren farmer hou ect ect stacey neuweiler hou ect ect mary smith hou ect ect meter nov aimee sitara deal for meter has expired on oct settlements is unable to draft an invoice for this deal this deal either needs to be extended or new deal needs to be set up please let me know when this is resolved we need it resolved by friday dec hc

After removing the stop words:

In [12]:
print(clean_email_no_stopwords[1])
['meter', 'nov', 'daren', 'could', 'please', 'resolve', 'issue', 'howard', 'office', 'next', 'two', 'days', 'done', 'please', 'let', 'george', 'know', 'thanks', 'aimee', 'forwarded', 'aimee', 'lannou', 'hou', 'ect', 'pm', 'howard', 'camp', 'pm', 'aimee', 'lannou', 'hou', 'ect', 'ect', 'cc', 'daren', 'farmer', 'hou', 'ect', 'ect', 'stacey', 'neuweiler', 'hou', 'ect', 'ect', 'mary', 'smith', 'hou', 'ect', 'ect', 'meter', 'nov', 'aimee', 'sitara', 'deal', 'meter', 'expired', 'oct', 'settlements', 'unable', 'draft', 'invoice', 'deal', 'deal', 'either', 'needs', 'extended', 'new', 'deal', 'needs', 'set', 'please', 'let', 'know', 'resolved', 'need', 'resolved', 'friday', 'dec', 'hc']
In [13]:
clean_email = []
for lis in clean_email_no_stopwords:
  clean_email.append(' '.join(word for word in lis))

clean_email[1]
Out[13]:
'meter nov daren could please resolve issue howard office next two days done please let george know thanks aimee forwarded aimee lannou hou ect pm howard camp pm aimee lannou hou ect ect cc daren farmer hou ect ect stacey neuweiler hou ect ect mary smith hou ect ect meter nov aimee sitara deal meter expired oct settlements unable draft invoice deal deal either needs extended new deal needs set please let know resolved need resolved friday dec hc'

Joining the tokenized words back to form a string for an email. 'clean_email' contains the list of emails after cleaning and removing the stop words.

In [14]:
email_dataframe["Email Category"] = email_dataframe["Email Category"].replace({"ham":0,"spam": 1})

Spliting the Dataset into Train and Test data

In [15]:
# Splits the data into train and test dataset in a ratio of 70:30
features = clean_email
classes = email_dataframe["Email Category"]
    
email_train, email_test, class_train, class_test = train_test_split(features, classes, train_size=0.7, test_size=0.3, shuffle=True)

print("Emails in the Training set: %d emails" % len(email_train))
print("Emails in the Test set: %d emails" % len(email_test))
Emails in the Training set: 3495 emails
Emails in the Test set: 1499 emails

Spliting the data into training and test data in a ratio of 70:30, there are 3495 emails in the training set and 1499 emails in the test set.

In [16]:
# Splits the traning data into train and validation dataset in a ratio of 80:20 in order to carry out cross validation on the Validation set
email_train, email_val, class_train, class_val = train_test_split(email_train, class_train, train_size=0.8, test_size=0.2, shuffle=True)

print("Emails in the Training set: %d emails" % len(email_train))
print("Emails in the Validation set: %d emails" % len(email_val))
Emails in the Training set: 2796 emails
Emails in the Validation set: 699 emails

Spliting the traning data into train and validation dataset in a ratio of 80:20 in order to carry out cross validation on the Validation set, there are 2796 emails in the training set and 699 emails in the validation set.

In [17]:
class_train.value_counts()
Out[17]:
0    1976
1     820
Name: Email Category, dtype: int64

There are 1968 emails for 'ham' category and 828 emails for the 'spam' category in the Training set.

In [18]:
# barplot of frequency of ham and spam in train data 
train = pd.DataFrame(data=class_train)
train["Email Category"] = train["Email Category"].replace({0:"ham",1:"spam"})
barplot_train = train['Email Category'].value_counts().plot(kind='bar',
                                                            title="Counts of ham and spam emails in Training data")
barplot_train.set_xlabel("Email Category")
barplot_train.set_ylabel("Count")
py.show()

Above figure shows the visual presentation of number of emails present in the Training set for ham and spam categories. There are 1968 emails for 'ham' category and 828 emails for the 'spam' category.

In [19]:
class_test.value_counts()
Out[19]:
0    1065
1     434
Name: Email Category, dtype: int64

There are 1074 emails for 'ham' category and 425 emails for the 'spam' category in the Test set.

In [20]:
# barplot of frequency of ham and spam in test data 
test = pd.DataFrame(data=class_test)
test["Email Category"] = test["Email Category"].replace({0:"ham",1:"spam"})
barplot_train = test['Email Category'].value_counts().plot(kind='bar',
                                                           title="Counts of ham and spam emails in Test data",
                                                           color="green")
barplot_train.set_xlabel("Email Category")
barplot_train.set_ylabel("Count")
py.show()

Above figure shows the visual presentation of number of emails present in the Test set for ham and spam categories. There are 1074 emails for 'ham' category and 425 emails for the 'spam' category.

Exploratory Data Analysis on the Training Set

In [21]:
#creating a dictionary for the Training Set
train_set_dict = {'Email Body' : email_train,'Email Category' : class_train}
train_set = pd.DataFrame(train_set_dict)
train_set["Email Category"] = train_set["Email Category"].replace({0:'ham',1: 'spam'})
train_set.head()
Out[21]:
Email Body Email Category
2110 fw fw march invoice question cut dry however m... ham
4945 nom actual flow agree forwarded melissa jones ... ham
2625 vance list meters sitara well head portfolio s... ham
441 toyota fuel guidelines starting november th ge... spam
4071 business ideas march hows going looking new wa... spam

Top 20 Most Frequently Used Words in Ham Emails

Creating a subset with only ham type values

In [22]:
# creating a subset with only ham type values
ham = train_set.loc[(train_set["Email Category"] == 'ham')]

ham.head()
Out[22]:
Email Body Email Category
2110 fw fw march invoice question cut dry however m... ham
4945 nom actual flow agree forwarded melissa jones ... ham
2625 vance list meters sitara well head portfolio s... ham
2130 gas flow show nom teco tap actual teco tap for... ham
3952 pricing issue production duke energy probably ... ham
In [23]:
# concatenating all rows of the data set into one string 

combined_ham_emails = ham["Email Body"].str.cat(sep=' ')
combined_ham_emails = combined_ham_emails.replace(',',' ')
 
# using word_tokenize to count the frequency of each word
words = nltk.tokenize.word_tokenize(combined_ham_emails)
word_dist = nltk.FreqDist(words)

# storing the data into dataframe
ham_frequent_words = pd.DataFrame(word_dist.most_common(20),
                    columns=['Word', 'Frequency'])

print("Top 20 Most Frequently Used Words in Ham Emails:")
print(ham_frequent_words)
Top 20 Most Frequently Used Words in Ham Emails:
         Word  Frequency
0         ect       7650
1         hou       3986
2       enron       3501
3         com       1756
4         gas       1590
5        deal       1498
6      please       1494
7       meter       1344
8          cc       1298
9          pm       1253
10        hpl       1247
11     thanks        988
12      daren        984
13       corp        922
14       know        817
15      mmbtu        761
16  forwarded        693
17       need        683
18        let        598
19     farmer        591

To find the most frequent words in the ham emails, we first combine all the emails into one string. Then using 'word_tokenize' to count the frequency of each word, we store the data into dataframe 'ham_frequent_words'.

In [24]:
# Barplot of most frequent words used in ham emails

fig = plt.bar(ham_frequent_words, x='Word', y='Frequency', color='Frequency',
             labels={'Word':'Most frequently used words in ham','Frequency':'Frequency of of each word'},height=500)
fig.show(renderer="notebook")

Above figure shows the visual presentation in the bar chart [2] of the 20 most frequent words used in ham emails.

In order to render this dynamic plot in the HTML document, we have used 'fig.show(renderer="notebook")'.

Top 20 Most Frequently Used Words in Spam Emails

Creating a subset with only spam type values

In [25]:
# creating a subset with only spam type values
spam = train_set.loc[(train_set["Email Category"] == 'spam')]

spam.head()
Out[25]:
Email Body Email Category
441 toyota fuel guidelines starting november th ge... spam
4071 business ideas march hows going looking new wa... spam
733 paliourg udtih wcwknoanopkt good morning palio... spam
4072 get real results real agra make want ever supe... spam
3378 make computer like new remove spyware home com... spam
In [26]:
# concatenating all rows of the data set into one string 

combined_spam_emails = spam["Email Body"].str.cat(sep=' ')
combined_spam_emails = combined_spam_emails.replace(',',' ')
 
# using word_tokenize to count the frequency of each word
words = nltk.tokenize.word_tokenize(combined_spam_emails)
word_dist = nltk.FreqDist(words)

# storing the data into dataframe
spam_frequent_words = pd.DataFrame(word_dist.most_common(20),
                    columns=['Word', 'Frequency'])

print("Top 20 Most Frequently Used Words in Spam Emails:")
print(spam_frequent_words)
Top 20 Most Frequently Used Words in Spam Emails:
           Word  Frequency
0           com        539
1          http        522
2       company        387
3           www        318
4          nbsp        310
5        please        270
6            us        261
7         email        259
8   information        258
9           get        257
10          new        243
11        price        237
12          one        220
13   statements        219
14          may        217
15        pills        207
16         time        198
17          inc        182
18        stock        172
19        money        169
In [27]:
# Barplot of most frequent words used in ham emails

fig = plt.bar(spam_frequent_words, x='Word', y='Frequency', color='Frequency',
             labels={'Word':'Most frequently used words in spam','Frequency':'Frequency of of each word'},height=500)
fig.show()

Above figure shows the visual presentation in the bar chart of the 20 most frequent words used in spam emails.

Boxplot for comparison of the distribution of email lengths in ham and spam emails

In [28]:
# creating a copy of the dataset and caculating the number of words of each row of 'Email Body' column 
train_set_copy = train_set
train_set_copy['Word Count'] = train_set_copy['Email Body'].apply(len)

# Describing the Email Categories
print(train_set_copy.groupby('Email Category').describe())

# boxplot of distribution of email Category ham and spam
fig = plt.box(train_set_copy, x="Email Category",y="Word Count",
            labels={'Type':'Type of email','word_count':'Total word count'})
fig.show()
               Word Count                                               \
                    count        mean         std   min     25%    50%   
Email Category                                                           
ham                1976.0  564.255061   891.54285  11.0  133.00  285.0   
spam                820.0  883.429268  1471.88486   0.0  190.75  378.0   

                                 
                   75%      max  
Email Category                   
ham             685.50  20401.0  
spam            851.25  21432.0  

The Spam emails are usually longer than the Ham emails since the median is greater for Spam emails. The mean value of the boxplot also confirms the same. The median length for ham emails is 270 while for spam is 387.

The maximum length of an email for both Spam and Ham are very close. The max value for ham emails is 20401 while for spam emails is 21432.

Both the Email Categories have outliers and there are extreme outliers above 20,000 value in the length for both the class.

Feature Extraction

In [29]:
vectorizer = TfidfVectorizer(min_df=2, encoding='utf-8', stop_words = 'english', analyzer='word')
    
message_train_Feature = vectorizer.fit_transform(email_train)
message_test_Feature = vectorizer.transform(email_test)

message_val_Feature = vectorizer.transform(email_val)

The purpose of feature extraction is to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text.

We use 'TfidfVectorizer' from the sklearn library to vectorize our features. TfidfVectorizer converts a collection of raw documents to a matrix of TF-IDF features, which is equivalent to CountVectorizer followed by TfidfTransformer.

We apply fit on the training dataset and use the transform method on the training dataset, the test dataset and the validation dataset.

In [30]:
message_train_Feature.shape
Out[30]:
(2796, 13175)
In [31]:
message_test_Feature.shape
Out[31]:
(1499, 13175)
In [32]:
message_val_Feature.shape
Out[32]:
(699, 13175)

Transform changes the data in the pipeline in order to pass it on to the next stage in the pipeline. We see that the shape is the same for all the dataset in order to process the vectorized Test and Validation set after fitting and transforming the training set.

Model Selection

We have selected various models such as Random Forest, K-Nearest Neighbor, Decision Tree, Multinomial Naive Bayes and Support Vector Classification to predict the outcomes for the Validation and Test Dataset and compare the accuracy scores of all the models thereby selecting the most accuarate model.

Random Forest Model

In [33]:
#Random Forest Model

random_forest = RandomForestClassifier(max_depth=10, n_estimators = 1000, random_state=0)
random_forest.fit(message_train_Feature,class_train)

# calculating the cross validation score for the Validation Set
rf_cv_result = cross_val_score(random_forest,message_val_Feature,class_val,cv=5)
print("Cross Validation Scores for the Validation set:")
print(rf_cv_result)

rf_cv_acc = round(np.mean(rf_cv_result)*100,2)
print("Accuracy of Random Forest on the Validation Set :",rf_cv_acc,"%")
Cross Validation Scores for the Validation set:
[0.8        0.78571429 0.81428571 0.77857143 0.76978417]
Accuracy of Random Forest on the Validation Set : 78.97 %

We apply the 'RandomForestClassifier' from sklearn library with depth of the tree as 10 and number of estimators as 1000.

Firstly, we calculate the cross validation score for the Validation Set with 5 folds.

Taking the mean of those scores, we get the accuracy of Random Forest model on the Validation Set as 79.26 %.

In [34]:
#Predicting for the Test Dataset
rf_pred = random_forest.predict(message_test_Feature)

print("Classification Report for the Test Data")
print(classification_report(class_test, rf_pred))

rf_test_accuracy = metrics.accuracy_score(rf_pred, class_test)
rf_test_accuracy = round(rf_test_accuracy*100,2)

print("Accuracy of Random Forest Model on the Test Set :",rf_test_accuracy,"%")
Classification Report for the Test Data
              precision    recall  f1-score   support

           0       0.79      1.00      0.88      1065
           1       0.99      0.35      0.52       434

    accuracy                           0.81      1499
   macro avg       0.89      0.68      0.70      1499
weighted avg       0.85      0.81      0.78      1499

Accuracy of Random Forest Model on the Test Set : 81.19 %

The Accuracy of Random Forest model on the Test Set is 82.19 %.

In [35]:
print("CONFUSION MATRIX for Random Forest Model on the Test Data:")
rf_confusion_matrix = confusion_matrix(class_test, rf_pred)
rf_conf_matrix = pd.DataFrame(data = rf_confusion_matrix, columns = ['Predicted HAM', 'Predicted SPAM'], index = ['Actual HAM', 'Actual SPAM'])    
print(rf_conf_matrix)
CONFUSION MATRIX for Random Forest Model on the Test Data:
             Predicted HAM  Predicted SPAM
Actual HAM            1064               1
Actual SPAM            281             153

From the confusion matrix, it can be observed that no ham email has been predicted as spam and 267 spam emails of test data has been predicted as ham.

K-Nearest Neighbor Model

In [36]:
#K-Nearest Neighbor Model

knn = KNeighborsClassifier(n_neighbors=3)
knn_model = knn.fit(message_train_Feature,class_train)

# Calculating the cross validation score for the Validation Set
knn_cv_result = cross_val_score(knn_model,message_val_Feature,class_val,cv=5)
print("Cross Validation Scores for the Validation set:")
print(knn_cv_result)

knn_acc = round(np.mean(knn_cv_result)*100,2)
print("Accuracy of KNN Model on the Validation Set :",knn_acc,"%")
Cross Validation Scores for the Validation set:
[0.93571429 0.96428571 0.95       0.95       0.94964029]
Accuracy of KNN Model on the Validation Set : 94.99 %

We apply the 'KNeighborsClassifier' from sklearn library with number of neighbors as 3.

Firstly, we calculate the cross validation score for the Validation Set with 5 folds.

Taking the mean of those scores, we get the accuracy of K-Nearest Neighbor Model on the Validation Set as 59.07 %.

In [37]:
#Predicting for the Test Dataset
knn_predict = knn_model.predict(message_test_Feature)

print("Classification Report for the Test Data")
print(classification_report(class_test, knn_predict))

knn_test_accuracy = metrics.accuracy_score(knn_predict, class_test)
knn_test_accuracy = round(knn_test_accuracy*100,2)
print("Accuracy of K-Nearest Neighbor Model on the Test Set :",knn_test_accuracy,"%")
Classification Report for the Test Data
              precision    recall  f1-score   support

           0       1.00      0.48      0.65      1065
           1       0.44      1.00      0.61       434

    accuracy                           0.63      1499
   macro avg       0.72      0.74      0.63      1499
weighted avg       0.84      0.63      0.64      1499

Accuracy of K-Nearest Neighbor Model on the Test Set : 63.24 %

The Accuracy of K-Nearest Neighbor Model on the Test Set is 96.66 %.

In [38]:
print("CONFUSION MATRIX for K-Nearest Neighbor Model on the Test Data:")
knn_confusion_matrix = confusion_matrix(class_test, knn_predict)
knn_conf_matrix = pd.DataFrame(data = knn_confusion_matrix, columns = ['Predicted HAM', 'Predicted SPAM'], index = ['Actual HAM', 'Actual SPAM'])    
print(knn_conf_matrix)
CONFUSION MATRIX for K-Nearest Neighbor Model on the Test Data:
             Predicted HAM  Predicted SPAM
Actual HAM             514             551
Actual SPAM              0             434

From the confusion matrix, it can be observed that 13 ham emails have been predicted as spam and 37 spam emails of test data have been predicted as ham.

Decision Tree Model

In [39]:
# Applying Decision Tree Model

dec_tree = DecisionTreeClassifier()
dt_model = dec_tree.fit(message_train_Feature,class_train)

# Calculating the cross validation score for the Validation Set
dt_cv_result =cross_val_score(dec_tree,message_val_Feature,class_val,cv=5)
print("Cross Validation Scores for the Validation set:")
print(dt_cv_result)

dt_acc = round(np.mean(dt_cv_result)*100,2)
print("Accuracy of Decision Tree Model on the Validation Set :",dt_acc,"%")
Cross Validation Scores for the Validation set:
[0.93571429 0.92857143 0.88571429 0.90714286 0.89928058]
Accuracy of Decision Tree Model on the Validation Set : 91.13 %

We apply the 'DecisionTreeClassifier' from sklearn library.

Firstly, we calculate the cross validation score for the Validation Set with 5 folds.

Taking the mean of those scores, we get the accuracy of Decision Tree Model on the Validation Set as 89.7 %.

In [40]:
#Predicting for the Test Dataset
dt_pred = dt_model.predict(message_test_Feature)

print("Classification Report for the Test Data")
print(classification_report(class_test, dt_pred))

dt_test_accuracy = metrics.accuracy_score(dt_pred, class_test)
dt_test_accuracy = round(dt_test_accuracy*100,2)
print("Accuracy of Decision Tree Model on the Test Set :",dt_test_accuracy,"%")
Classification Report for the Test Data
              precision    recall  f1-score   support

           0       0.97      0.94      0.96      1065
           1       0.87      0.93      0.90       434

    accuracy                           0.94      1499
   macro avg       0.92      0.94      0.93      1499
weighted avg       0.94      0.94      0.94      1499

Accuracy of Decision Tree Model on the Test Set : 93.86 %

The Accuracy of Decision Tree Model on the Test Set is 94.06 %.

In [41]:
print("CONFUSION MATRIX for Decision Tree Model on the Test Data:")
dt_confusion_matrix = confusion_matrix(class_test, dt_pred)
dt_conf_matrix = pd.DataFrame(data = dt_confusion_matrix, columns = ['Predicted HAM', 'Predicted SPAM'], index = ['Actual HAM', 'Actual SPAM'])    
print(dt_conf_matrix)
CONFUSION MATRIX for Decision Tree Model on the Test Data:
             Predicted HAM  Predicted SPAM
Actual HAM            1004              61
Actual SPAM             31             403

From the confusion matrix, it can be observed that 61 ham emails have been predicted as spam and 28 spam emails of test data have been predicted as ham.

Multinomial Naive Bayes Model

In [42]:
# Multinomial Naive Bayes Model
mnbc = MultinomialNB()
mnbc.fit(message_train_Feature,class_train)

# Calculating the cross validation score for the Validation Set
mnb_cv_result =cross_val_score(mnbc,message_val_Feature,class_val,cv=5)
print("Cross Validation Scores for the Validation set:")
print(mnb_cv_result)

mnb_acc = round(np.mean(mnb_cv_result)*100,2)
print("Accuracy of Multinomial Naive Bayes Model on the Validation Set :",mnb_acc,"%")
Cross Validation Scores for the Validation set:
[0.85714286 0.82142857 0.87857143 0.85714286 0.8057554 ]
Accuracy of Multinomial Naive Bayes Model on the Validation Set : 84.4 %

We apply the 'MultinomialNB' from sklearn library.

Firstly, we calculate the cross validation score for the Validation Set with 5 folds.

Taking the mean of those scores, we get the accuracy of Multinomial Naive Bayes Model on the Validation Set as 84.69 %.

In [43]:
#Predicting for the Test Dataset
MNBC_preds = mnbc.predict(message_test_Feature)

print("Classification Report for the Test Data")
print(classification_report(class_test, MNBC_preds))

mnb_test_accuracy = metrics.accuracy_score(MNBC_preds, class_test)
mnb_test_accuracy = round(mnb_test_accuracy*100,2)
print("Accuracy of Multinomial Naive Bayes Model on the Test Set :",mnb_test_accuracy,"%")
Classification Report for the Test Data
              precision    recall  f1-score   support

           0       0.95      0.99      0.97      1065
           1       0.98      0.87      0.93       434

    accuracy                           0.96      1499
   macro avg       0.97      0.93      0.95      1499
weighted avg       0.96      0.96      0.96      1499

Accuracy of Multinomial Naive Bayes Model on the Test Set : 95.93 %

The Accuracy of Multinomial Naive Bayes Model on the Test Set is 96.26 %.

In [44]:
print("CONFUSION MATRIX for Multinomial Naive Bayes Model on the Test Data:")
mnb_confusion_matrix = confusion_matrix(class_test, MNBC_preds)
mnb_conf_matrix = pd.DataFrame(data = mnb_confusion_matrix, columns = ['Predicted HAM', 'Predicted SPAM'], index = ['Actual HAM', 'Actual SPAM'])    
print(mnb_conf_matrix)
CONFUSION MATRIX for Multinomial Naive Bayes Model on the Test Data:
             Predicted HAM  Predicted SPAM
Actual HAM            1059               6
Actual SPAM             55             379

From the confusion matrix, it can be observed that 5 ham emails have been predicted as spam and 51 spam emails of test data have been predicted as ham.

Support Vector Classification

In [45]:
# Support Vector Classification Model

svc = SVC()
svc_model = svc.fit(message_train_Feature,class_train)

# Calculating the cross validation score for the Validation Set
svc_cv_result =cross_val_score(svc,message_val_Feature,class_val,cv=5)
print("Cross Validation Scores for the Validation set:")
print(svc_cv_result)

svc_acc = round(np.mean(svc_cv_result)*100,2)
print("Accuracy of Support Vector Classification Model on the Validation Set :",svc_acc,"%")
Cross Validation Scores for the Validation set:
[0.93571429 0.93571429 0.95714286 0.95       0.92805755]
Accuracy of Support Vector Classification Model on the Validation Set : 94.13 %

We apply the 'SVC' from sklearn library.

Firstly, we calculate the cross validation score for the Validation Set with 5 folds.

Taking the mean of those scores, we get the accuracy of Support Vector Classification Model on the Validation Set as 92.13 %.

In [46]:
#Predicting for the Test Dataset
svc_pred=svc_model.predict(message_test_Feature)

print("Classification Report for the Test Data")
print(classification_report(class_test, svc_pred))

svc_test_accuracy = metrics.accuracy_score(svc_pred, class_test)
svc_test_accuracy = round(svc_test_accuracy*100,2)
print("Accuracy of Support Vector Classification Model on the Test Set :",svc_test_accuracy,"%")
Classification Report for the Test Data
              precision    recall  f1-score   support

           0       1.00      0.98      0.99      1065
           1       0.96      0.99      0.97       434

    accuracy                           0.98      1499
   macro avg       0.98      0.99      0.98      1499
weighted avg       0.99      0.98      0.98      1499

Accuracy of Support Vector Classification Model on the Test Set : 98.47 %

The Accuracy of Support Vector Classification Model on the Test Set is 99.53 %.

In [47]:
print("CONFUSION MATRIX for Support Vector Classification Model on the Test Data:")
svc_confusion_matrix = confusion_matrix(class_test, svc_pred)
svc_conf_matrix = pd.DataFrame(data = svc_confusion_matrix, columns = ['Predicted HAM', 'Predicted SPAM'], index = ['Actual HAM', 'Actual SPAM'])    
print(svc_conf_matrix)
CONFUSION MATRIX for Support Vector Classification Model on the Test Data:
             Predicted HAM  Predicted SPAM
Actual HAM            1046              19
Actual SPAM              4             430

From the confusion matrix, it can be observed that 7 ham emails have been predicted as spam and no spam emails of test data have been predicted as ham.

For Support Vector Classification Model, ROC (Receiver Operating Chracteristic) curve is plotted for sensitivity (True Positive Rate - ratio of actual positives that are correctly classified) against specificity (1-False Positive Rate or True Negative Rate - ratio of actual negatives that are correctly classified).

In [48]:
svc_disp = plot_roc_curve(svc_model, message_test_Feature, class_test)

Comparison of accuracy of all the classifier models

Creating a Data Frame and storing the Accuracy scores of the Test Set of different models in order to compare.

In [49]:
# creating an empty data frame to store the Accuracy scores of the Test Set of different models
model_accuracy = pd.DataFrame(columns = ["Model", "Accuracy"])

# saving the accuracy in the dataframe
model_accuracy = model_accuracy.append({'Model':"Random Forest",'Accuracy':rf_test_accuracy},ignore_index=True)
model_accuracy = model_accuracy.append({'Model':"K-Nearest Neighbor",'Accuracy':knn_test_accuracy},ignore_index=True)
model_accuracy = model_accuracy.append({'Model':"Decision Tree",'Accuracy':dt_test_accuracy},ignore_index=True)
model_accuracy = model_accuracy.append({'Model':"Multinomial Naive Bayes",'Accuracy':mnb_test_accuracy},ignore_index=True)
model_accuracy = model_accuracy.append({'Model':"Support Vector Classification",'Accuracy':svc_test_accuracy},ignore_index=True)

model_accuracy
Out[49]:
Model Accuracy
0 Random Forest 81.19
1 K-Nearest Neighbor 63.24
2 Decision Tree 93.86
3 Multinomial Naive Bayes 95.93
4 Support Vector Classification 98.47

Bar Plot comparison of accuracy of all the classifier models

In [50]:
# Bar plot comparison of accuracy of all the classifier models
accuracy_comparison = plt.bar(model_accuracy, x='Model', y='Accuracy', color='Accuracy',
             labels={'Model':'Model Name','Accuracy':'Accuracy of the Model'}, height=400)
accuracy_comparison.show()

Conclusion

From all the selected various models: Random Forest, K-Nearest Neighbor, Decision Tree, Multinomial Naive Bayes and Support Vector Classification, it is observed that the Support Vector Classification model fits best for this dataset for classifying emails as ham and spam with the model accuracy of 99.53 % as compared with other classifier models.